Arabic text summarization based on latent semantic analysis to enhance arabic documents clustering
نویسندگان
چکیده
Arabic Documents Clustering is an important task for obtaining good results with the traditional Information Retrieval (IR) systems especially with the rapid growth of the number of online documents present in Arabic language. Documents clustering aim to automatically group similar documents in one cluster using different similarity/distance measures. This task is often affected by the documents length, useful information on the documents is often accompanied by a large amount of noise, and therefore it is necessary to eliminate this noise while keeping useful information to boost the performance of Documents clustering. In this paper, we propose to evaluate the impact of text summarization using the Latent Semantic Analysis Model on Arabic Documents Clustering in order to solve problems cited above, using five similarity/distance measures: Euclidean Distance, Cosine Similarity, Jaccard Coefficient, Pearson Correlation Coefficient and Averaged Kullback-Leibler Divergence, for two times: without and with stemming. Our experimental results indicate that our proposed approach effectively solves the problems of noisy information and documents length, and thus significantly improve the clustering performance.
منابع مشابه
Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملBig Data Categorization for Arabic Text Using Latent Semantic Indexing and Clustering
Documents categorization is an important field in the area of natural language processing. In this paper, we propose using Latent Semantic Indexing (LSI), singular value decomposing (SVD) method, and clustering techniques to group similar unlabeled document into pre-specified number of topics. The generated groups are then categorized using a suitable label. For clustering, we used Expectation–...
متن کاملAutomatic Multi-Document Arabic Text Summarization Using Clustering and Keyphrase Extraction
Automatic text summarization has become important due to the rapid growth of information texts since it is very difficult for human beings to manually summarize large documents of texts. A full understanding of the document is essential to form an ideal summary. However, achieving full understanding is either difficult or impossible for computers. Therefore, selecting important sentences from t...
متن کاملMulti-Document Arabic Summarization Using Text Clustering to Reduce Redundancy
“The process of multi-document summarization is producing a single summary of a collection of related documents. In this work we focus on generic extractive Arabic multi-document summarizers. We also describe the cluster approach for multi-document summarization. The problem with multi-document text summarization is redundancy of sentences, and thus, redundancy must be eliminated to ensure cohe...
متن کاملThe CIST Summarization System at TAC 2011
In this report, we present our extractive summarization system on both summarization and multiling tracks of TAC 2011. We introduce an extractive multi-document summarization method based on hierarchical topic model of hierarchical Latent Dirichlet Allocation (hLDA) and sentence compression. hLDA is a representative generative probabilistic model, which not only can mine latent topics from a la...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1302.1612 شماره
صفحات -
تاریخ انتشار 2013